2025.10.16 | UniMoE一统语音音乐;注意力图点亮大模型推理
Description
本期的 15 篇论文如下:
[00:21 ] 🎧 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE(UniMoE-Audio:基于动态容量MoE的统一语音与音乐生成模型)
[00:57 ] 🔍 Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization(注意力照亮大模型推理:预规划-锚定节奏实现细粒度策略优化)
[01:38 ] ⚡ FlashWorld: High-quality 3D Scene Generation within Seconds(FlashWorld:秒级高质量3D场景生成)
[02:06 ] 🐝 Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs(Bee:高质量语料与全栈套件解锁完全开源多模态大模型)
[02:37 ] 🗣 InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue(InteractiveOmni:面向音视频多轮对话的统一全模态模型)
[03:24 ] 🌍 PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning(PhysMaster:通过强化学习掌握视频生成的物理表征)
[04:00 ] 🧪 LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models(LIBERO-Plus:视觉-语言-动作模型鲁棒性深度剖析)
[04:43 ] 🚗 CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving(CVD-STORM:面向自动驾驶的跨视角视频扩散时空重建模型)
[05:21 ] 🔍 Generative Universal Verifier as Multimodal Meta-Reasoner(生成式通用验证器:多模态元推理的反思引擎)
[06:07 ] ⚖ ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs(ParallelBench:探明扩散式大模型并行解码的取舍)
[06:43 ] 🎞 Trace Anything: Representing Any Video in 4D via Trajectory Fields(任意视频4D轨迹场表示:一次前馈即可还原每像素连续时空路径)
[07:27 ] 🌍 Reasoning in Space via Grounding in the World(基于世界锚定的空间推理)
[07:54 ] 🧠 The Role of Computing Resources in Publishing Foundation Model Research(计算资源在基础模型研究发表中的角色)
[08:28 ] ⚖ UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning(UniME-V2:用多模态大模型当裁判,打造通用多模态表征)
[09:05 ] 🤖 InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy(InternVLA-M1:面向通用机器人策略的空间引导视觉-语言-动作框架)
<figure>
【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递